On the Episode Duration Distribution in Fixed-Policy Markov Decision Processes

نویسندگان

  • Itamar Arel
  • Andrew S. Davis
چکیده

This paper presents a formalism for determining the episode duration distribution in fixed-policy Markov decision processes (MDP). To achieve this goal, we borrow the notion of obtaining the n-step first visit probability from queuing theory, apply it to a Markov chain derived from the MDP, and arrive at the distribution of the episode durations between any two arbitrary states. We illustrate the proposed methodology with an agent navigating a 25-state maze, demonstrating the applicability of the method.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Integrated Queuing Model for Site Selection and Inventory Storage Planning of a Distribution Center with Customer Loss Consideration

    Discrete facility location,   Distribution center,   Logistics,   Inventory policy,   Queueing theory,   Markov processes, The distribution center location problem is a crucial question for logistics decision makers. The optimization of these decisions needs careful attention to the fixed facility costs, inventory costs, transportation costs and customer responsiveness. In this paper we stu...

متن کامل

Learning Unknown Markov Decision Processes: A Thompson Sampling Approach

We consider the problem of learning an unknown Markov Decision Process (MDP) that is weakly communicating in the infinite horizon setting. We propose a Thompson Sampling-based reinforcement learning algorithm with dynamic episodes (TSDE). At the beginning of each episode, the algorithm generates a sample from the posterior distribution over the unknown model parameters. It then follows the opti...

متن کامل

Markov Chain Anticipation for the Online Traveling Salesman Problem by Simulated Annealing Algorithm

The arc costs are assumed to be online parameters of the network and decisions should be made while the costs of arcs are not known. The policies determine the permitted nodes and arcs to traverse and they are generally defined according to the departure nodes of the current policy nodes. In on-line created tours arc costs are not available for decision makers. The on-line traversed nodes are f...

متن کامل

(More) Efficient Reinforcement Learning via Posterior Sampling

Most provably-efficient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Mark...

متن کامل

Reinforcement Learning of POMDPs using Spectral Methods

We propose a new reinforcement learning algorithm for partially observable Markov decision processes (POMDP) based on spectral decomposition methods. While spectral methods have been previously employed for consistent learning of (passive) latent variable models such as hidden Markov models, POMDPs are more challenging since the learner interacts with the environment and possibly changes the fu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010